The Philadelphia Water Department, in accordance with a consent agreement with the EPA, is in the midst of one of the most extensive green infrastructure investments in the country. Philadelphia has planned to invest $2.4 billion in public funds into the Green Cities, Clean Waters stormwater infrastructure plan between 2009 and 2035. $1.68 billion of that is benchmarked for green stormwater infrastructure projects. Thus far, PWD is on track to meet the negotiated goal of constructing 10,000 acre-inches of stormwater retention capacity across the combined sewer service area by the year 2035. However, in order to meet that goal, project costs must be accurately projected and understood. As the initial round of infrastructure is constructed, which represent the “low-hanging fruit projects – those that are the easiest and most immediately cost effective, additional project opportunities will likely become more costly and difficult.
Since grants for private projects do not reflect the actual project construction cost, this study is limited to only public projects that are funded primary by PWD and constructed under the supervision of PWD engineers. The public projects program has 6 program types: schools, parks, parking, vacant land, facilities, and streets. This study groups together all parcel projects (schools, parks, parking, vacant land, and facilities) as public parcel projects and compares with streets projects. There are several type of stormwater management practices (SMP), or stormwater interventions, that PWD regularly employs in public projects. SMP types range from pervious pavement to rain gardens, to bioswales. However, not all green infrastructure is equally green, the infrastructure exists more on a spectrum of grey to green interventions. Click through the diagrams below using the tabs to see examples of green infrastructure interventions and notice the increasing prevalence of grey infrastructure strategies.
The dependent variable in this study is the total cost of an individual project which includes several SMP types and can span across several program types. The main reason for modeling cost is to develop a more nuanced understanding of the factors that effect project cost across the city. This understanding can help in infrastructure planning in both locating future projects, but also identifying the most cost effective SMP types and program types based on the watershed. Such a cost model would help PWD planners determine the most effective and cost efficient systems to prioritize.
The second reason for modeling cost is to create a predictive model that can be used to estimate future project costs that have not yet been estimated. Such a model would allow the input of planning-level variables such as project location, design capacity, proposed SMP types, and proposed program types and return a project estimated cost and estimated error to help more accurately allocate capital improvements budgets. A top-down cost model would be useful at the planning level because it is more accurate than basic cost estimates and it eliminates the time and cost of a project estimator in the planning process.
The study explores 2 different modeling approaches. The first being a basic multivariate logistic model that incorporates the best predictor variables narrowed down from all available variables. This process represents the base case modeling process that represents the most common modeling approach. The process started with running a ‘kitchen sink’ linear regression with all of the independent variables, an correlation test to eliminate highly co-linear variables, and a stepwise regression in order to identify the variables that have the strongest and most statistically significant interaction with the dependent variable.
The second approach is our proposed alternative that will be tested against the basic multivariate modeling process. This approach divides up into 2 separate models, one for land cover variations and the other for land use variations. The purpose of this split is determine separately the effects of both land cover and land use on the cost modeling. Each of these two models then split the data up into 3 multivariate linear models based on the land use or land cover cluster grouping. The purpose of this is to model cost separately for each cluster with the hypothesis that similar land uses and land cover mixes will yield similar cost trends. Both the land cover and land use models were then blended back together using a weighted mean blending approach. The purpose of blending the 2 models back together is to make the ultimate model more robust since the dataset only has 150 data points. Ensembling multiple models together ensures that the model is not over-fit to the training data and ins therefore more generalizable to new data that the mode hasn’t yet seen. In this case, the models are being tuned and trained on past constructed projects and they will be tested on contracted and in construction projects in order to see how well the model predicts on new data.
Watersheds were calculated using ArcHydro tools in ArcMap. The watershed boundaries were used as geographic boundaries of study because sewer systems, which rely of course on gravity, generally follow topography. Therefore, watersheds are the best proxy for the drainage areas of each combined sewer outfall. The resulting pour points from the hydrological modeling nearly perfectly aligned with the combined sewer outfall points, which confirms the notion that sewer systems generally follow surface hydrology (sewer mains are generally where hydrological modeling places streams and sewer main drainage areas are generally similar to surface watershed basins). Because there are 63 watersheds and only 150 public projects, using watersheds as the only geographic fixed effect would result in an over-fit model since there would be so few data points in each watershed. To solve this problem and ensure a generalizable model, we performed a cluster analysis in order to place the 63 watersheds into 3 categories based on commonalities in land cover and land use.
K-means clustering analysis was performed on a 5-dimensional dataset of landcover including percent pavement, percent roads, percent buildings, percent tree canopy, and percent pervious cover. An elbow plot helped to determine the optimal number of clusters. In this particular case 3 clusters is the inflection point of the chart: the point where the chart’s exponential decay starts to shrink.
The plot below visualizes the clustered data points. However, it is important to note that this is merely a 2-dimensional representation of a 5-dimensional cluster. Nonetheless, the clustering of the data points is clear and could perhaps benefit from an additional cluster to the right side of the chart, however then it would create a cluster with too few data points for any statistically meaningful analysis and insight.
The map below shows the final cluster groupings on the map. If you take a moment to zoom in on various clusters, you will notice clear distinctions in the built environments between the 3 clusters. Click on each part of the map to see in which cluster it is categorized.
The Plot below shows the average percent land cover composition of each cluster. Scroll your mouse over each part of the chart to see with which land cover types each cluster is associated. Cluster 2 is associated with the highest pervious cover whereas cluster 3 is the lowest pervious cover and has the highest building cover percent.
K-means clustering analysis was performed on a 8-dimensional dataset of the following land use groupings: industrial, retail, transportation (roads), vacant, residential, open space, civic & cultural, and office & mixed use. The same elbow plot process helped to determine the optimal number of clusters. In this case it initially appeared that 5 clusters would be optimal, but upon analyzing the resulting cluster map, it became apparent that 5 clusters would leave at least 1 of the clusters with too few data point projects from which to draw any meaningful or statistically significant conclusions. As a result, we settled on 3 clusters in order to best distribute data points into clusters.
5 clusters leaves clusters 4 and 5 with only a few watersheds and hence only a few project data points falling within those clusters. This cluster configuration also places too many watersheds and project points into cluster 2, which is counterproductive to the intent of the cluster analysis.
It is visually clear that 3 clusters is a better distribution of watersheds. Zooming in on various mapped clusters, it is visually apparent that the built environment is distinct for each of the 3 cluster categories.
The Plot below shows the average percent land use composition of each cluster. Scroll your mouse over each part of the chart to see with which land type types each cluster is associated. Cluster 3 is medium and low density residential neighborhoods, cluster 2 is primarily industrial & center city, and cluster 1 is primarily highway right-of-ways.
All of the infrastructure data was obtained from PWD’s 2018 annual update report to the National Pollution Discharge Elimination System (NPDES). The original report can be found here: http://phillywatersheds.org/doc/FY18_MS4_CSO_withAppendices.pdf
The data was copied from the report, joined to geographic data from PWD obtained via OpenDataPhilly and processed. All streets projects that crossed watershed boundaries were split at the watershed boundary intersection and the total project cost, capacity, and SMPs were multiplied by the new segment’s length as a function of the project’s total length. This enabled a much more accurate accounting of project metrics as they related to their geographic orientation. All of the input data is listed in the data dictionary below. All of the projects costs were adjusted to 2018 dollars using the regional construction price index obtained from the US Census Bureau. This ensures that all project costs are being compared equally so that the model doesn’t interpret older projects as being cheaper due to subsequent inflation.
| Variable | Description | Source |
|---|---|---|
| PRIMARYPROGRAMNAME | PWD GCCW Program Name | 2018 NPDES Report |
| SMP_TREETRENCH | Has Tree Trench | From PWD via OpenDataPhilly.com |
| SMP_PLANTER | Has Planter | From PWD via OpenDataPhilly.com |
| SMP_BUMPOUT | Has Bumpout | From PWD via OpenDataPhilly.com |
| SMP_RAINGARDEN | Has Rain Garden | From PWD via OpenDataPhilly.com |
| SMP_BASIN | Has Basin | From PWD via OpenDataPhilly.com |
| SMP_INFILTRATIONSTORAGETRENCH | Has Infintration Storage Trench | From PWD via OpenDataPhilly.com |
| SMP_PERVIOUSPAVING | Has Pervious Paving | From PWD via OpenDataPhilly.com |
| SMP_SWALE | Has Bio Swale | From PWD via OpenDataPhilly.com |
| SMP_WETLAND | Has Constructed Wetland | From PWD via OpenDataPhilly.com |
| SMP_GREENROOF | Has Green Roof | From PWD via OpenDataPhilly.com |
| SMP_OTHER | Has Other SMP Type | From PWD via OpenDataPhilly.com |
| SMP_STORMWATERTREE | Has Stormwater Tree | From PWD via OpenDataPhilly.com |
| SMP_DRAINAGEWELL | Has Drainage Well | From PWD via OpenDataPhilly.com |
| SMP_GREENGUTTER | Has Green Gutter | From PWD via OpenDataPhilly.com |
| SMP_BLUEROOF | Has Blue Roof | From PWD via OpenDataPhilly.com |
| SMP_DEPAVING | Has Depaving | From PWD via OpenDataPhilly.com |
| SMP_INFILTRATIONCOLUMNS | Has Infintration Columns | From PWD via OpenDataPhilly.com |
| TREES | Number of Trees Planted | From PWD via OpenDataPhilly.com |
| GREENED_ACRES | Project Design Capacity (acre-inches) | 2018 NPDES Report |
| COST | Project Cost | 2018 NPDES Report |
| year | Year of Construction Completion | 2018 NPDES Report |
| landCover_cluster | Land Cover Cluster Number | Calculated from 1m resolution land cover dataset from Delewave Valley Regional Planning Commission website |
| landUse_cluster | Land Use Cluster Number | Calculated from Philadelphia City Planning Commission zoning map in OpenDataPhilly.com |
| land_use | Land Use of Project Site | Philadelphia City Planning Commission zoning map in OpenDataPhilly.com |
This first modeling process started with a kitchen sink model. This model reveals which variables are statistically significant in explaining the dependent variable. Form the initial ‘kitchen sink’ model, forward and backward stepwise regression functions were run in order to guide variable selection. The statistically significant combination of variables that exhibited the highest Akaike information criterion (AIC) is what was selected for the final model. The results of both models can be seen in the table below. In total, 15 variables were dropped while the R2 was only reduced from 0.84 to 0.83.
| Dependent variable: | ||
| COST | ||
| Kitchen Sink | Leanest | |
| (1) | (2) | |
| PRIMARYPROGRAMNAMESchools | -167,043.200 | |
| (128,385.100) | ||
| PRIMARYPROGRAMNAMEStreets | 60,078.610 | |
| (170,805.800) | ||
| PRIMARYPROGRAMNAMEVacant Land | -112,319.100 | |
| (96,149.950) | ||
| SMP_TREETRENCH | 47,581.020*** | 38,658.490*** |
| (10,006.520) | (7,900.184) | |
| SMP_PLANTER | 15,097.500 | |
| (14,929.270) | ||
| SMP_BUMPOUT | 64,295.980** | 67,731.950** |
| (27,876.620) | (26,979.450) | |
| SMP_RAINGARDEN | 31,550.930** | 39,117.140*** |
| (12,579.790) | (11,210.790) | |
| SMP_BASIN | 10,835.270 | |
| (140,067.200) | ||
| SMP_INFILTRATIONSTORAGETRENCH | 16,518.970* | 16,697.800* |
| (9,747.491) | (8,635.705) | |
| SMP_PERVIOUSPAVING | -4,635.609 | |
| (37,074.020) | ||
| SMP_SWALE | 98,930.380*** | 92,183.320*** |
| (31,358.720) | (29,477.680) | |
| SMP_WETLAND | ||
| SMP_GREENROOF | ||
| SMP_STORMWATERTREE | 14,408.580*** | 12,521.620*** |
| (2,714.358) | (2,479.046) | |
| SMP_DRAINAGEWELL | ||
| SMP_GREENGUTTER | ||
| SMP_BLUEROOF | ||
| SMP_DEPAVING | -11,361.180 | |
| (27,188.830) | ||
| TREES | -1,909.564 | |
| (1,274.589) | ||
| GREENED_ACRES | 132,030.100*** | 127,258.900*** |
| (12,129.400) | (8,855.484) | |
| landCover_cluster | -32,839.620** | -29,556.730* |
| (16,183.230) | (15,104.110) | |
| landUse_cluster | 3,673.245 | |
| (15,557.040) | ||
| land_useCulture_Recreation | -71,504.820 | |
| (92,793.910) | ||
| land_useIndustrial | 157,179.000 | |
| (152,742.100) | ||
| land_usePark_openSpace | 2,615.963 | |
| (86,917.210) | ||
| land_useTransportation | -162,452.500 | |
| (182,523.500) | ||
| land_useVacant | ||
| Constant | 193,772.200** | 102,043.100*** |
| (93,034.720) | (32,268.290) | |
| Observations | 149 | 149 |
| R2 | 0.838 | 0.825 |
| Adjusted R2 | 0.811 | 0.815 |
| Residual Std. Error | 128,115.600 (df = 127) | 126,576.400 (df = 140) |
| F Statistic | 31.206*** (df = 21; 127) | 82.683*** (df = 8; 140) |
| Note: | p<0.1; p<0.05; p<0.01 | |
The data was broken up into 3 sets based on the land cover cluster in which each project falls. ‘Kitchen sink’ models were run on all three groupings including all independent variables.
| Dependent variable: | |||
| COST | |||
| Cluster 1 | Cluster 2 | Cluster 3 | |
| (1) | (2) | (3) | |
| PRIMARYPROGRAMNAMESchools | -146,898.800 | -196,078.100 | |
| (175,454.400) | (137,989.400) | ||
| PRIMARYPROGRAMNAMEStreets | 271,358.100 | 47,706.320 | 115,168.900 |
| (422,207.600) | (87,503.050) | (220,841.800) | |
| PRIMARYPROGRAMNAMEVacant Land | 14,155.910 | ||
| (161,387.700) | |||
| SMP_TREETRENCH | 52,518.250** | 20,012.950 | 48,522.080*** |
| (22,394.100) | (15,759.890) | (15,579.650) | |
| SMP_PLANTER | 6,050.937 | 50,366.460*** | -69,202.210 |
| (14,784.350) | (17,468.150) | (76,639.610) | |
| SMP_BUMPOUT | 26,100.120 | 91,132.570** | 89,832.770 |
| (58,083.970) | (40,643.830) | (53,760.830) | |
| SMP_RAINGARDEN | 100,002.800 | 20,086.340 | 42,281.450* |
| (81,571.440) | (24,672.270) | (23,200.500) | |
| SMP_BASIN | 65,715.200 | ||
| (168,589.000) | |||
| SMP_INFILTRATIONSTORAGETRENCH | -34,014.760 | 19,868.440 | 27,685.620 |
| (23,805.510) | (15,676.560) | (16,327.440) | |
| SMP_PERVIOUSPAVING | -8,052.243 | -22,958.930 | -14,057.800 |
| (64,735.550) | (43,321.440) | (127,461.500) | |
| SMP_SWALE | 645,328.100* | 358,773.200*** | -50,245.720 |
| (358,424.700) | (51,152.860) | (46,198.420) | |
| SMP_WETLAND | |||
| SMP_GREENROOF | |||
| SMP_STORMWATERTREE | 5,050.887 | 17,850.230*** | |
| (4,205.043) | (3,612.364) | ||
| SMP_DRAINAGEWELL | |||
| SMP_GREENGUTTER | |||
| SMP_BLUEROOF | |||
| SMP_DEPAVING | -19,614.870 | ||
| (18,467.490) | |||
| TREES | -2,268.615 | -652.798 | -2,773.313 |
| (2,033.548) | (2,023.536) | (2,360.300) | |
| GREENED_ACRES | 182,193.500*** | 162,117.900*** | 131,521.400*** |
| (32,717.730) | (24,497.070) | (18,105.390) | |
| landCover_cluster | |||
| landUse_cluster | -15,236.090 | 1,845.125 | 16,238.840 |
| (21,060.200) | (22,916.460) | (46,961.460) | |
| land_useCulture_Recreation | 373,361.100 | 63,480.870 | |
| (360,434.100) | (93,975.230) | ||
| land_useIndustrial | 124,165.600 | ||
| (241,760.800) | |||
| land_usePark_openSpace | 82,724.700 | 62,993.100 | |
| (111,623.300) | (148,391.800) | ||
| land_useTransportation | -79,645.790 | ||
| (248,653.700) | |||
| land_useVacant | 15,958.520 | ||
| (218,541.500) | |||
| Constant | -220,688.900 | -17,508.030 | -57,293.520 |
| (424,329.600) | (99,019.260) | (144,679.500) | |
| Observations | 38 | 71 | 40 |
| R2 | 0.913 | 0.832 | 0.948 |
| Adjusted R2 | 0.861 | 0.786 | 0.904 |
| Residual Std. Error | 74,815.910 (df = 23) | 101,861.100 (df = 55) | 136,057.400 (df = 21) |
| F Statistic | 17.308*** (df = 14; 23) | 18.157*** (df = 15; 55) | 21.430*** (df = 18; 21) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
The statistically significant independent variables were selected and used to run the final land cover cluster models. The results of the three land cover cluster models can be seen in the table below.
| Dependent variable: | |||
| COST | |||
| Cluster 1 | Cluster 2 | Cluster 3 | |
| (1) | (2) | (3) | |
| GREENED_ACRES | 180,966.200*** | 178,614.900*** | 130,029.100*** |
| (17,736.440) | (14,164.930) | (10,153.360) | |
| SMP_PLANTER | 56,800.370*** | ||
| (16,599.870) | |||
| SMP_BUMPOUT | 93,358.670** | ||
| (40,422.380) | |||
| SMP_SWALE | 302,550.900*** | 334,438.500*** | |
| (71,737.510) | (47,608.540) | ||
| SMP_TREETRENCH | 26,529.970* | 32,732.490*** | |
| (15,368.520) | (10,607.820) | ||
| SMP_RAINGARDEN | 18,654.580 | ||
| (14,083.110) | |||
| SMP_INFILTRATIONSTORAGETRENCH | 40,035.840*** | ||
| (10,845.920) | |||
| SMP_STORMWATERTREE | 15,113.380*** | ||
| (3,267.196) | |||
| Constant | 33,359.770 | 56,133.790** | 27,252.900 |
| (30,438.510) | (22,155.390) | (32,164.410) | |
| Observations | 38 | 71 | 40 |
| R2 | 0.794 | 0.790 | 0.920 |
| Adjusted R2 | 0.776 | 0.778 | 0.909 |
| Residual Std. Error | 94,833.570 (df = 34) | 103,866.500 (df = 66) | 132,876.900 (df = 34) |
| F Statistic | 43.708*** (df = 3; 34) | 62.208*** (df = 4; 66) | 78.488*** (df = 5; 34) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
The data was broken up into 3 sets based on the land use cluster in which each project falls. ‘Kitchen sink’ models were run on all three groupings including all independent variables.
| Dependent variable: | |||
| COST | |||
| Cluster 1 | Cluster 2 | Cluster 3 | |
| (1) | (2) | (3) | |
| PRIMARYPROGRAMNAMESchools | -203,754.700 | 295,698.100 | |
| (151,801.200) | (204,454.000) | ||
| PRIMARYPROGRAMNAMEStreets | -12,508.360 | 69,624.460 | 110,835.500 |
| (107,328.000) | (91,362.780) | (222,430.800) | |
| PRIMARYPROGRAMNAMEVacant Land | -178,194.300 | 58,886.030 | |
| (131,382.900) | (107,829.800) | ||
| SMP_TREETRENCH | -28,764.340 | 13,058.150 | 50,878.330*** |
| (54,112.760) | (18,725.480) | (8,959.121) | |
| SMP_PLANTER | 168,031.500 | 91,675.180*** | -4,378.223 |
| (344,401.700) | (24,221.560) | (11,745.840) | |
| SMP_BUMPOUT | 98,743.900** | 79,039.830*** | |
| (43,707.120) | (25,047.400) | ||
| SMP_RAINGARDEN | 25,346.600 | 26,660.070 | 65,010.430** |
| (36,919.240) | (29,122.890) | (31,735.860) | |
| SMP_BASIN | 152,485.300 | ||
| (125,580.600) | |||
| SMP_INFILTRATIONSTORAGETRENCH | -64,969.750* | 12,734.420 | 43,969.580*** |
| (35,080.840) | (18,968.460) | (9,512.063) | |
| SMP_PERVIOUSPAVING | -68,325.510 | 52,955.350 | -35,216.600 |
| (57,370.580) | (76,144.830) | (44,116.240) | |
| SMP_SWALE | 349,056.000*** | -76,517.850** | |
| (50,497.260) | (36,962.960) | ||
| SMP_WETLAND | |||
| SMP_GREENROOF | |||
| SMP_STORMWATERTREE | 1,587.471 | 16,872.650*** | |
| (7,770.459) | (2,130.245) | ||
| SMP_DRAINAGEWELL | |||
| SMP_GREENGUTTER | |||
| SMP_BLUEROOF | |||
| SMP_DEPAVING | -15,078.590 | ||
| (18,243.300) | |||
| TREES | -1,017.830 | -1,217.883 | -1,609.621 |
| (4,520.068) | (2,746.605) | (1,075.825) | |
| GREENED_ACRES | 178,135.600*** | 186,283.600*** | 123,382.700*** |
| (42,188.600) | (31,675.580) | (12,553.830) | |
| landCover_cluster | -36,408.190 | -57,841.020 | -21,949.250* |
| (48,561.580) | (60,229.360) | (12,380.440) | |
| landUse_cluster | |||
| land_useCulture_Recreation | 2,438.777 | 66,339.920 | |
| (118,100.200) | (159,033.900) | ||
| land_useIndustrial | 91,338.820 | ||
| (125,918.100) | |||
| land_usePark_openSpace | 76,884.210 | 25,511.520 | |
| (102,349.500) | (126,014.700) | ||
| land_useTransportation | -58,611.550 | ||
| (184,720.500) | |||
| land_useVacant | |||
| Constant | 235,090.700 | 65,632.020 | 14,493.280 |
| (145,378.400) | (149,329.700) | (180,033.600) | |
| Observations | 21 | 51 | 77 |
| R2 | 0.915 | 0.881 | 0.955 |
| Adjusted R2 | 0.811 | 0.835 | 0.938 |
| Residual Std. Error | 101,989.900 (df = 9) | 106,805.100 (df = 36) | 81,737.910 (df = 56) |
| F Statistic | 8.800*** (df = 11; 9) | 19.027*** (df = 14; 36) | 58.802*** (df = 20; 56) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
The statistically significant independent variables were selected and used to run the final land use cluster models. The results of the three land use cluster models can be seen in the table below.
| Dependent variable: | |||
| COST | |||
| Cluster 1 | Cluster 2 | Cluster 3 | |
| (1) | (2) | (3) | |
| GREENED_ACRES | 170,050.900*** | 199,442.800*** | 112,871.800*** |
| (22,287.250) | (24,851.690) | (6,901.850) | |
| SMP_PLANTER | 10,566.800 | ||
| (123,153.900) | |||
| SMP_RAINGARDEN | 51,983.860*** | ||
| (13,935.960) | |||
| SMP_TREETRENCH | 45,462.970*** | ||
| (6,647.728) | |||
| SMP_BUMPOUT | 76,293.650*** | ||
| (24,317.710) | |||
| SMP_SWALE | -59,230.460** | ||
| (25,945.940) | |||
| SMP_STORMWATERTREE | 15,443.530*** | ||
| (2,062.747) | |||
| SMP_INFILTRATIONSTORAGETRENCH | 44.747 | 45,317.380*** | |
| (23,334.100) | (8,231.365) | ||
| Constant | 67,893.530* | 78,401.970* | 30,140.690** |
| (39,028.140) | (40,730.490) | (15,076.330) | |
| Observations | 21 | 51 | 77 |
| R2 | 0.766 | 0.589 | 0.942 |
| Adjusted R2 | 0.740 | 0.572 | 0.936 |
| Residual Std. Error | 119,642.200 (df = 18) | 171,875.000 (df = 48) | 83,177.070 (df = 69) |
| F Statistic | 29.440*** (df = 2; 18) | 34.383*** (df = 2; 48) | 160.110*** (df = 7; 69) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
For both the land use and land cover models, each of the three cluster models were recombined so that as additional data is entered into the model, it is sorted into eh according linear model according to cluster and the resulting predictions are recombined together and joined back to the input dataset.
The land use and land cluster model outputs were combined using a weighted means ensemble model which takes the average of each model’s fitted predictions for each row. The result is an ensembled prediction that takes into account insights from both land use and land cover. The intent of this modeling approach is to make the final predicted values as generalizable as possible, meaning that the final model is not too over-fit to any of the input independent variables. This modeling approach was taken because there are only 250 data points in the data set and so it would be very easy to over-fit the model with so few data. The last reason for the modeling approach is to be able to take into account variations in cost based on geographic location. For example, green infrastructure in center city is more expensive than green infrastructure in the less dense Northeast Philly mainly due to differences land values and land uses but also due to differences in SMP types and capacities across the city. For example, SMPs in South Phily tend to be smaller and greyer than those in West Philly due to variations in lot sizes, land uses, and land covers.
The model summary below shows that the modeled cost of system design capacity across the combined sewer system is roughly $132,000 per greened acre. Through the use of the stepwise function, the model was reworked to determine its best fit outcome with the least variables, resulting in the dropping of 14 variables from the initial kitchen sink model. With the exception of the land cover clusters, all of the variables have a positive correlation.
Of the variables associated with types of SMP’s the least costly stormwater trees. The most expensive is swales. While swales from a unit standpoint, may be the most expensive, the breakdown into a comparable unit across different types could present different results, when considering a square foot of a tree trench would cost more than a square foot of a swale.
Except for swales, the projects associated with a more hybridized green and grey system approach are the highest, which based on assumptions would make sense as these projects require a considerable more amount of work and materials related to their construction. In this case, that would include bumpouts and tree trenches.
From the table depicted, the R squared of the leanest and meanest model illustrates that 82% of the dependent variable, cost, can be predicted by the selected independent variables of the model. Simply put, this model does a good job of predicting cost.
Green infrastructure costs were generally cheaper in land cover cluster 3, with a predicted cost of about $131,500. This is approximately $30,000 cheaper than cluster 2 and $50,000 cheaper than cluster 1.
These results, based on the land cover clusters is what could be expected from our assumptions. For cluster 1, the cluster analysis shows it having a much higher density of buildings than the others, meaning its a high impervious surface cluster. Looking at the map of the clusters, the watersheds are identified in the center city and south Philadelphia areas, which are higher density areas of the city. Cluster 2 is the second most impervious, with cluster 3 having the most pervious surface at 69% of the watersheds total areas associated with land cover that is pervious, trees and grass. The areas defined as the most pervious are watersheds along the furthest extents of center city Philadelphia, like Wynnefield and Germantown. These areas have larger green spaces and swaths of greater tree area as well as less dense development.
In evaluating the R squared results for each of the clusters, one and three have slightly higher values than cluster 2, with cluster 3 being the highest. With an R squared of .904, cost in cluster three does a good job in predicting the cost of the independent variables. The results of cluster 2 indicate that perhaps the lower R squared value could be contributed by an average amount of land cover in all categories, so its correlation with neither highly pervious or impervous means that the relationship of the variables to their environment could be less linked.
Statistically significant variables in the cluster models range across the models, but each only has about 3-4 worth elaborating. Cluster 1 identifies tree trenches and swales as significant. Swales, for instance, could result in such high costs, like $645,000 because the amount of available square footage for these types of projects is limited and dedicating high value land in these areas is too precious to devote to entirely low tech green infrastructure like a swale. Cluster 2 shows swales again being highly statistically significant, even more so than cluster 1, however in these areas the cost is nearly half as much as it would be to install in cluster 1 at $359,000. Finally, cluster 3 shows stormwater trees and tree trenches as significant independent variables. Their significance, or best fit can be identified but the higher amounts of these variables in the dataset. It should be noted that both cluster 1 and cluster 3 had statistical significance with the variable tree trenches. Cluster 3’s predicted cost for these SMP types is about $5,000 cheaper than those in cluster 1. Knowing their land cover breakouts, this could be due to the lower construction costs associated with working on greenfield and lower density sites as opposed to those in high density, low open space areas in cluster 1.
Green infrastructure is the most cost effective in land use cluster 3, costing around $120,000 per each additional greened acre of infrastructure in that cluster, with $170,000 for cluster 1 and $185,00 for cluster 2.
Based on the cluster analysis results, cluster 3 has the highest ratio of residential, at nearly 50%. Cluster 1 has the largest ratio of transportation land use and cluster 2 having more vacant land and industrial space. These clusters could give you a clearer picture as to the models results per cluster. For instance, since cluster 3 has the least expensive cost per greened acre, one could argue that SMPs increase in cost as density and impervious surface increases.
Turning to the R squared calculations, cluster 3 has the model which is the most fit and can be used to successfully predict over 93% of the time. Models for Cluster 1 and 2 were slightly lower at 0,81 and 0.85, illustrating that though they are lower, all three ensemble models do a good job at predicting cost for their independent variables. With cluster 3 having the highest R squared, this could be a result of the model having the most observations at 77, opposed to 51 and 21 respectively.
Cluster 3 has the most independent variables associated with high statistical significance, 8 of the original 22. This model shows a few interesting findings. One is that swales have negative correlation, meaning that for every increase in a swale project, there would be a $79,000 decrease in cost. Considering the land use associated with cluster 3, this could mean that a significant cost savings for swale projects would be achieved in watersheds with high residential land use. The lowest positively correlated variable is storm water tree types and the highest being bumpouts, $16,000 and $79,000 respectively. As has been discussed previously this is most likely the case due to lack of space, increased land costs, construction costs, and labor associated with projects like these in high density areas.
Cluster 2 had three statistically significant variables - swales, bumpouts, and planters. All of them being more expensive than other models more greener types of SWP’s. It should also be noted that aside from the greened acres variable, cluster 1 had no other statistically significant variables. Considering the cluster 1 model had about a quarter of the total observations as cluster 3, 21 total, this could be due to a lack of datapoints for the model to compare other results.
10-fold cross validation was performed on each of the models. Each model was trained on a test set that randomly comprised 80% of the data and then tested on the remaining 20% test set. The mean absolute percent error (MAPE) was calculated for each model using all of the predicted versus observed values (absolute error).
In order to test the model on new data, the models were also run using future planned projects. These are projects that are either currently in construction or have been bid. This is the first time this data has been introduced to the model, so this is really the best way to test the models’ performance. This data was also obtained from the 2018 NPDES report. It should be noted, however, that the average cost per green acre of planned projects ($323,000) is far greater than that of constructed projects ($271,000), so it was expected that the model would systematically under-predict these costs. It is unknown why future costs are so much higher than past costs. One interpretation is that stormwater projects are becoming more expensive as the ‘lowest-hanging-fruit’, or most cost-effective, project have been completed first. However, this is speculation.
The fitted versus observed values were plotted below to visualize each model’s predictions. Notice how the points become clustered closer around the identity line (slope of 1) as the MAPE decreases.
\(\color{orange}{\text{The orange points are the k-folds cross validation predictions.}}\) \(\color{blue}{\text{The blue points are the future planned project predictions.}}\)
Click through the 3 tabs below to view the cross validation results for each of the 3 models.
The table below shows a summary of the k-folds cross-validation for each of the models. What is interesting to note is that the single multivariate model performed poorly in the cross-validation set, but it performed the best in the future projects set. In both past and future projects cross-validation, the blended ensemble was the second highest performer. This indicates that the blended ensemble is the best cost model to use since it is the most generalizable across both data sets.
| Data | Single Multivariate | Land Use Ensemble | Land Cover Ensemble | Blended Ensemble | Mean Cost/Greened Acre |
|---|---|---|---|---|---|
| Constructed Projects | 0.492923733335603 | 0.509819528969696 | 0.431320088230125 | 0.442790481123919 | 271136.712874869 |
| Planned Projects | 0.293661748199805 | 0.300378157524042 | 0.379128337513664 | 0.319506817531249 | 323546.798285281 |
The results above indicate that the geographic clustering and ensemble modeling approach is superior to a single multivariate modeling approach for such a small data set. The modeling approach is more generalizable to new data and does not over-fit to training data because there are several levels of ensembling that reduce the potential for any independent variable to over-power the predictive power of others. The generalizability is important if the model is to be used to estimate future project costs.
Another advantage to the cluster and ensemble approach is that it takes into account cost variations due to geography, land use, and the built environment. While the single multivariate model includes land cover and land use as independent variables, those variables are not statistically significant because their predictive power has been over-powered by the greened acres variable, which is the most significant in the model. Because project cost estimation is much more nuanced than cost-per-unit across the entire city, so too should a predictive cost model.
This modeling approach could be implemented by the green infrastructure planning team in order to obtain immediate cost estimates at the planning level to be able to make decisions early in the planning process that will manifest later in more cost-effective infrastructure.
While the results associated with the model are significant, there are some interesting things envisioning the future of cost prediction and other implications for this data as there are ways the data could be manipulated to improve the models results. One way to improve the model would be to break out the SMP projects by their total sizes so that comparisons and the models variable costs by type of SMP would be in direct relation to each other. Currently the projects are priced as a unit derived from the water departments own metrics of measure, not something that is standardized.
Adding more projects to the model as they become available will increase the accuracy of the results of the model itself and could open up new information and relationships, especially of independent variables being statistically significant. Additionally, using socioeconomic data within these watersheds could allow for a better understanding of cost implications in different demographic groups by area, resulting in key takeaways that are not only about the bottom line but could also help the Philadelphia Water Department determine areas of the city where projects would be both cost effective and have a positive social impact.
For future exploration, we would like to investigate which types of infrastructure, whether green or gray, are the most cost effective in neighborhood context. Additionally, identifying the most cost-effective SMP types for each watershed would help in the infrastructure planning process. Additional data would be necessary for this future exploration that would include individual SMP sizes. As the data is now, each SMP type is considered as a single unit (1 bioswale is the dame unit as 1 stormwater tree). The variation in SMP sizes within each project would allow better understanding of the cost variation between SMP types, whereas the model currently is only able to differentiate between cost variations in SMP types between watersheds.